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ABSTRACT 

Humans may be exposed to whole-body vibration in envi- 
ronments where clear speech communications are crucial, 
particularly during the launch phases of space flight and in 
high-performance aircraft. Prior research has shown that high 
levels of vibration cause a decrease in speech intelligibility. 
However, the effects of whole-body vibration upon speech 
are not well understood, and no attempt has been made to 
restore speech distorted by whole-body vibration. In this 
paper, a model for speech under whole-body vibration is pro- 
posed and a method to remove its effect is described. The 
method described reduces the perceptual effects of vibration, 
yields higher ASR accuracy scores, and may significantly 
improve intelligibility. Possible applications include incor- 
poration within communication systems to improve radio- 
communication systems in environments such a spaceflight, 
aviation, or off-road vehicle operations. 

Index Terms — Whole-Body Vibration, Speech Intelligi- 
bility 

1. INTRODUCTION 

Speech production is inhibited when humans are exposed to 
whole-body vibration between 2 and 20 Hz [1]. Examples of 
environments where humans are exposed to these vibration 
levels include spacecraft, high-performance aircraft, military 
land vehicles, and heavy machinery such as tractors. In these 
contexts clear speech communications are crucial; in partic- 
ular, speech intelligibility for radio communications between 
crew and ground control is of concern during launch phases 
of space flight because other means of communication such as 
operation of manual controls are extremely difficult if not im- 
possible. NASA standards require speech intelligibility levels 
to be equivalent to a 90% word identification [2], but prior re- 
search has shown that speech under whole-body vibration is 
at least 9% less intelligible than speech in non-vibrated con- 
ditions [3]. Even in situations where intelligibility remains 
high, “distortions of the speech signal will increase listen- 
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Fig. 1. Speaker positioned in the semi-supine position on an 
experimental vibration platform. 

ing effort and fatigue, and reduce speech quality to the point 
where communication becomes difficult and annoying” [4]. 

NASA has addressed the need for developing analytic 
models for human vibration response in order to predict the 
effects on manual performance and speech production [4]. 
Previous studies [5, 6, 7, 8, 9] have examined the physical 
effects of whole-body vibration on mechanisms of the vocal 
production system and examined the distortion of the speech 
signal. These studies found that vibration between 2 and 
20 Hz causes disruptions in airflow which in turn cause fre- 
quency and amplitude modulations in the resulting speech. 
However, no model for speech under whole-body vibration 
has been proposed, and no effort has been made to address 
this reduction in intelligibility. It is generally difficult or im- 
possible to remove the vibration itself, warranting methods 
to improve speech intelligibility that do not involve chang- 
ing the vibration environment. There has been no previous 
research on removing the vibration effect from the speech 
signal directly. 

Whole-body vibration is defined by Griffin [1] as occur- 
ring “when the body is supported on a surface which is vi- 
brating”, such as when sitting on a vibrating seat, standing on 
a vibrating floor, or lying on a vibrating bed. The studies ad- 
dressed in this paper consider speakers positioned in the face- 
up recumbent (semi-supine) position affected by sinusoidal 



vibration in the body’s x-axis (back to chest) as shown in Fig- 
ure 1. We examine sinusoidal vibrations with constant fre- 
quency because they estimate the vibration present in space- 
flight environments. Sinusoidal vibration levels will be char- 
acterized in this paper by frequency (Hz) and 0-peak accel- 
eration amplitude (measured in units of earth’s gravity g). 
We focus on the communication channel between a speak- 
ing crew member (exposed to vibration) and a listener in a 
ground control scenario (not exposed to vibration). Unlike 
more common noise reduction problems, the goal is not to 
remove background artifacts, but to remove distortion from 
the source itself. This paper proposes a model for vibrated 
speech and presents a method to remove or reduce the effects 
of vibration on the speech signal to improve speech quality 
and intelligibility. 


2. A MODEL FOR VIBRATED SPEECH 


A study similar in setup to [9] was conducted in the Hu- 
man Vibration Laboratory at NASA Ames Research Center. 
Speech samples of sustained phonemes and sentences were 
gathered from 6 speakers at 4 vibration conditions 1 . The 
model proposed here is motivated by analysis of this data and 
results from similar studies such as [5, 6, 7, 9]. 

The primary observed characteristics of the data was that 
the fundamental frequency, energy, and formant frequencies 
of vibrated speech oscillate as a function of the vibration ac- 
celeration. An example of these characteristics can be seen 
in Figure 2. Our model for vibrated speech is based upon 
the source-system model for speech production, where each 
short time frame Sn [n\ of a speech signal 8 [n\ is modeled as 
the output of a filter with coefficients cxfi = { a h{k)}^ =1 and 
source “excitation” [n] . Pitch and energy oscillations are 
modeled as modulations of the excitation e[n\, and formant 
frequency oscillations are modeled as modulations of the fil- 
ter coefficients a. Under the assumption that the vibration is 
sinusoidal with known constant frequency / v , we assume the 
source excitation is amplitude modulated by the function 


M a (t ) = A sin(27r f Y (t + k)) + B 

and frequency modulated by 

Mf(t ) = t — cos(27r/ v (£ + h)) 
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such that 0<A<B,0<D<1, and h, k G 

The models for M a {t) and Mf(t ) are based on the 
premise that the airflow quantity passing through the vo- 
cal tract during whole-body vibration is proportional to the 
acceleration acting on the body. Oscillatory quantities of air- 
flow passing through the vocal tract cause an effect (similar to 
musical vibrato) in which energy and frequency of the voice 
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Fig. 2. Effect of vibration on a sustained vowel [o] at 12 Hz 
vibration. The waveform and spectrogram of the vibrated 
vowel (top, upper middle) are shown, along with the fre- 
quency response of the source- system filter over time (lower 
middle) and the spectrogram of the excitation e[n\ (bottom). 
Sustained vibrated phonemes sound somewhere between a 
very wide musical vibrato and a bleating goat. 


oscillates along with the airflow. Note that this formulation 
is different from traditional AM/FM modulation in the sense 
that the roles of the carrier and modulator are reversed. 

The analog source excitation £(£) is decomposed as a sum 
of sinusoids as in [10], such that 

L 

t(t) = Y di sin(2n fit + fa) (3) 

where the parameters a ?: , /), fa are respectively the ampli- 
tude, frequency and phase of sinusoid i, and L is the number 
of sinusoids in the decomposition. The resulting vibrated ex- 
citation e[n\ is given by 

e[n} = M a (t n )-E(M f (t n )) (4) 


*8 Hz, 0.5 3 ; 12 Hz 0.5 g; 12 Hz, 0.7 g; and 16 Hz 0.5 g 


where t n is a sequence of time samples. Finally, the model 


for vibrated speech can be written as 

p 

Sh[n\ = ^ocfti^Snln - k] + e^[n] (5) 

k=l 

where is a vector of modulated filter coefficients. 

2.1. Parameter Estimation 

Given an observed vibrated excitation, the parameters for 
the model can be estimated. The amplitude modulation 
parameters are chosen by fitting the observed data to the 
vibrated excitation model in Equation (4). However, the 
frequency-modulated source excitation £(Mf(t n )) is com- 
plex. £(Mf(t n )) is simplified for this stage of parameter 
estimation to be a random process w [n ] , such that for each n 
the expected value E (w[n]) = 0 and E [w[n] 2 ) = 1. Then 
the vibrated excitation can be approximated by 


of the squared errors between ujj [n\ and the modeled instan- 
taneous frequency Lp'j(t n ). If the sum is taken over N points 
as described in the amplitude modulation case, the parameters 
can also be solved for analytically, giving the optimal param- 
eters 

\j (En=i Uj [n\ sin(27r/ v t n )) + (En=i u j M cos(27r/ v t n )) 

^ ~ N 

\ E H 

n=l 

1 N 

fj = y c Jj[n] 

JJ 2nN ^ 1 

n= 1 

hj = -2- arctan ( 

27F /v \ En= 1 “i [n] Sin(27r fytn) J 

The final parameters are chosen from the estimate j * with the 
smallest overall error, such that D = Dj* and h = hj* . 

3. REMOVING VIBRATION 


e[n\ ~ (Asin(27r/ V (t n + k)) + B) • w[n\ (6) 


Let y[n\ be an observed vibrated excitation. The param- 
eters A , B, and k are chosen to minimize the expected value 
of the sum of the squared distances between \y[n\ \ and \e[n\\. 
If the sum is taken over N = jy- points where j E N and 

f s is the sampling frequency, the parameters can be solved 
for analytically. Up to a multiplicative constant, the optimal 
parameters are 


A = 


N\ 


E \y[n}\ sin(27r/ v tn) 


+ 


E \y[n] | cos(27r/ v tn) 


1 N 

B = M E i»ni 

n= 1 

k = arctan 

2? r/y 


/ En=l |yN|cQs(27T/vt n ) \ 

\En=i |y[n]|sin(27T/vtn)/ 


The frequency modulation parameters are chosen by fit- 
ting the amplitude de-modulated signal to the frequency mod- 
ulated excitation model: 


L 

€(Mf(t n )) = ^2 a i sin (2 nfiMf(t n ) + <f>i) (7) 

i= 1 

Instead of fitting the model to the data directly, time/frequency 
tracks in the STFT (sequences of neighboring local maxima 
over time) as described in [10] of the data are fit to the in- 
stantaneous frequency of a sinusoid in Equation (7). The 
instantaneous phase of sinusoid j is given by 

Df 

Vj(t n ) = 2nfjt n j 1 cos(2ir f v (t n + h)) + (8) 

Jw 

and the corresponding instantaneous frequency is 

<Pj(t n ) = 27T fj + 2nDfj sin(2ir f v (t n + h)) (9) 

For each time/frequency track ujj [n], an initial estimate of 
the parameters Dj , fj , and hj are chosen to minimize the sum 


The signal is preprocessed by first high-pass filtering with a 
cutoff at 40 Hz to remove additive mechanical noise from the 
vibrating platform. Next the speech is filtered using a typical 
speech pre-emphasis filter, which balances the low and high 
frequencies in speech for more accurate analysis. Finally, the 
speech is separated into evenly- spaced 6 ms frames and the 
frames are grouped by phoneme. The following steps are per- 
formed to remove vibration from phonemes sustained for at 
least one period of vibration: (1) perform frame by frame lin- 
ear predictive analysis to extract filter coefficients and excita- 
tion; (2) estimate amplitude modulation parameters from vi- 
brated excitation and remove amplitude modulation; (3) esti- 
mate frequency modulation parameters and remove frequency 
modulation; (4) compute smoothed filter coefficients; (5) gen- 
erate recovered speech using smoothed filter coefficients and 
source excitation. These steps are described in detail below. 

For each time frame Sft [n ] , the coefficient vector a and 
the excitation [n\ are computed using linear predictive anal- 
ysis. The frames e^[n] are combined using overlap-add to 
form the excitation signal e[n\. 

The amplitude modulation model parameters k. A, and 
B are computed as described in Section 2.1. Given these 
parameters, the amplitude modulation can be removed from 
the vibrated excitation, leaving only the frequency modulated 
source excitation: 


aMf(t n )) 




A sin(27r/ v (f n + k)) + B 


GO) 


The frequency modulation model parameters D and h are 
then computed from f(Mf(t n )) as described in Section 2.1. 
The frequency modulation is removed by frequency modulat- 


ing £(Mf(t n )) by the function — — 7 J- — 

nn)) y 2 ttD sin (27r/ v (t n + ft,)) + 1 

via resampling the signal in short time intervals. 

Given the recovered source excitation e[n\ and the vi- 
brated coefficients a(n) from the original short time frames, 


Speaker 

Clean 

Vibrated 

De-vibrated 

1 

69.3% 

51.2% 

54.7% 

2 

66.5% 

50.4% 

52.6% 

3 

58.0% 

38.1% 

40.6% 

4 

58.7% 

46.3% 

49.7% 

5 

50.2% 

45.1% 

47.4% 

6 

59.4% 

38.4% 

47.0% 

Mean 

60.4% 

44.9% 

48.7% 


Table 1. Vowel classification accuracy for each speaker, 
a partially de-vibrated speech signal s[n\ is generated 

p 

Sn[n] = ^a fl (k)sn[n-k\+en[n} ( 11 ) 

k = 1 

The “true” filter coefficients are estimated by perform- 
ing another round of linear predictive analysis on s[n] for 
time frames m with length equal to the vibration period -jr . 
Each resulting coefficient vector gives an estimate of a se- 
quence of the “true” filter coefficients a^. If hi = - + iTs, 

and rhj = ^ where T# and Tl are the lengths of the 

short and long time frames respectively, then a. 7 f lj « . for 

all i such that jT/, < hi < (j + l)Tz,. The final recovered 
speech is given by: 

p 

Sh[n\ « y^a^(fc)g^[n - fc] + e^[n] (12) 

fc=i 

4. RESULTS 

An example of vibrated speech before and after processing 
is shown in Figure 3. The restored speech is free of ampli- 
tude, frequency and formant modulations, and is perceptually 
clearer. To test the results numerically we ran instances of 
single vowels through a classifier. We used 10 vowel classes 
with 6 ms time frame MFCC’s as feature vectors. For each 
speaker we trained an SVM on the other 5 speaker’s clean 
vowels (^30,000 time frames total), and tested on the tar- 
get speaker’s clean(^6,000 time frames), vibrated, and cor- 
responding de-vibrated vowels (^25,000 time frames). Note 
that this is a much smaller number of speakers than should 
normally be used for this problem, however this method was 
used simply as a proof of concept. The results are shown in 
Table 4. The average clean accuracy is low due to the small 
amount of variation in the test data but the overall trend is 
still present. There is a significant drop in classification ac- 
curacy from clean to vibrated, and a consistent improvement 
from vibrated to de-vibrated. While this is by no means an 
complete investigation of the effects of vibration on ASR ac- 
curacy, these results indicate that the proposed method does 
not hurt and may improve accuracy. 

This method was also tested on data that was vibrated syn- 
thetically based upon the model proposed. The parameters 


are consistently estimated accurately within a small tolerance 
level, and the original clean speech is restored. 
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Fig. 3. Spectrograms of the vibrated phoneme [a] at 12 Hz, 
0.7 g (left) and restored phoneme (right). 

While a limitation of this method is that it requires a 
full vibration period, it does not pose a problem in practice. 
Phonemes sustained for less than one vibration period do not 
last long enough to be audibly affected by the vibration. In 
most cases, this method introduces a small amount of nois- 
iness (similar to additive white noise) due to the final filter 
coefficient smoothing step. However the overall quality of the 
speech is better after processing despite the addition of noise. 
The noise is more apparent for male compared to female 
speakers. 

5. CONCLUSIONS 

This paper has presented a model that helps provide a bet- 
ter understanding of the different effects of sinusoidal whole- 
body vibration on speech signals, and a method to remove the 
effect that shows promise to improve intelligibility. 

This work has focused on the ideal case where the applied 
vibration is sinusoidal at constant frequency and amplitude. 
However, in practice the vibration is not always this simple. A 
natural extension of this work would be to broaden the vibra- 
tion model and the proposed inversion method to handle dif- 
ferent types of vibration, such as complex or random. Given 
the acceleration of the environment over time (as could be 
measured in real time with an accelerometer) the same model 
could apply such that the amplitude, frequency, and formants 
change proportional to the acceleration. 


6. REFERENCES 


[1] Michael J Griffin, Handbook of human vibration , Aca- 
demic press, 1990. 

[2] ANSI S3. 2-1989, Method for measuring the intelligibil- 
ity of speech over communication systems. , American 
National Standards Institute, R1999. 

[3] Durand Begault, “Effect of whole-body vibration on 
speech, part ii: Effect on intelligibility,” in Audio En- 
gineering Society Convention 131. Audio Engineering 
Society, 2011. 

[4] NASA/SP-20 10-3407, NASA Human Integration De- 
sign Handbook ( HIDH ), Washington, D.C.: NASA, 
2010 . 

[5] C.W. Nixon, “Influence of selected vibrations on speech 
i. range of 10 cps to 50 cps.,” The Journal of Auditory 
Research , vol. 2, pp. 247-266, 1962. 

[6] Charles W Nixon and Henry C Sommer, Influence of 
Selected Vibrations upon Speech ( Range of 2 cps-20 cps 
and Random ), Aerospace Medical Research Laborato- 
ries, 1963. 

[7] Ch W Nixon and HC Sommer, “Influence of selected 
vibrations upon speech iii. range of 6 cps to 20 cps for 
semi-supine talkers,” Aerospace medicine , vol. 34, pp. 
1012-1017, 1963. 

[8] Robert J Teare, “Human hearing and speech during 
whole-body vibration,” Tech. Rep., DTIC Document, 
1963. 

[9] Durand Begault, “Effect of whole-body vibration on 
speech, part i: Stimuli recording and speech analysis,” 
in Audio Engineering Society Convention 127. Audio 
Engineering Society, 2009. 

[10] Robert McAulay and Thomas Quatieri, “Speech anal- 
ysis/synthesis based on a sinusoidal representation,” 
Acoustics, Speech and Signal Processing, IEEE Trans- 
actions on , vol. 34, no. 4, pp. 744-754, 1986. 


